AIDI 1002: Machine Learning Programming — Assignment - 2

Due Date : November 23, 2022, 11:59 PM

Fangji Chen

1. Consider this dataset from kaggle. (Download the dataset from following link : https://www.kaggle.com/shrutimechlearn/step-by-step-kmeans-explained-in-detail/data) and answer the following questions :

1.1 Perform k-means clustering over this dataset using Manhattan distance as the distance-measure. (10 Points)

1.2 After performing k-means clustering, extract the groups or clusters and add a separate column in your dataset as ‘Labels’ and fill it with cluster number assigned by k-means algorithm. (5 Points)

1.3 Now, you should be ready with your labeled dataset. Perform standard classification task using logistic regression, decision trees, random forest, and Naive Bayes algorithm. (25 Points)

1.4 Compare the performance of these various supervised learning algorithm and comment on the homogeneity of clusters, like is the clusters or groups are making sense or not ? (10 Points)


1.4.1. Analyze on previous supervised learning algorithms

LR received the highest average performance with accuracy 0.9700. Therefore LR is the most suitable model.



1.4.2. Analyse on the homogeneity of kmeans clusters

As Silhouette Score show, k=5 is the best option.


1.4.2.1. Silhouette Score suggests clusters k=5

1.4.2.2. Elbow plot didn't shows a significent elbow, so I think k=5 suggested by Silhouette Score is the optimal clusters k.

1.4.2.3. visualization of kmeans clusters with Manhattan distance metric


Visualization could help people to understand the homogeneity of K-Means clusters.

As shown below: clusters k=5 shows the best performance. This is an other evidence to enhance our knowledge from Silhouette Score.

As a conclusion, the optimal k is 5.


2. Consider the breast_cancer dataset given in the sklearn library and answer the following questions.

2.1 Import the breast_cancer dataset from sklearn.datasets library. (5 Points)

2.2 Perform PCA (2 components) and LDA (1 components) over the dataset. (20 Points)

2.2.1. Perform LDA

2.2.2. Perform PCA

2.3 Visualise the components and see if its able to segregate the class label in breast_cancer dataset. (10 Points)


LDA performs the best to segregate the class labels in breast_cancer dataset within one component.


2.3.1. LDA visualization


Since there is only one component for the LDA output, I concat another variable to this result to make it performs better.


2.3.2. PCA visualization on breast cancer dataset

2.4 What is the maximum variance explained by both the components in PCA and LDA. (10 Points)


The first component of PCA explained 98.2% variance, while the first component of LDA explained 100.0% variance.


2.5 Comment on the working of PCA and LDA and which one is better for breast_cancer dataset. (5 Points)


As the plots shows, LDA performs much better on its first priciple component at the ratio of 100%. As a comparation, the first priciple component of PCA didn't performs that good. Because the optimal objective of PCA is to find the orthogonal basis that best explains the sample variances.

Compare against the variance ratio for PCA and LDA

top components of LDA: [1.], which means it captures 100% class separability of breast cancer dataset.

top components of PCA: [0.98204467, 0.01617649, 0.00155751], sum(pca.explained_varianceratio) = 99.98%, which means these top three components explans as much as 99.98% of the variances of the original breast cancer sample set.

The principle components obtained from PCA are different from those components from LDA

LDA principle components captures class separability. The ratios explains the classification performance of top LDA components. PCA principle components explains variances of each components. Top components offers more information about the data. The higher the eigen value is, the more information the eigen vector brings.